Adaptive seeds tame genomic sequence comparison.
نویسندگان
چکیده
The main way of analyzing biological sequences is by comparing and aligning them to each other. It remains difficult, however, to compare modern multi-billionbase DNA data sets. The difficulty is caused by the nonuniform (oligo)nucleotide composition of these sequences, rather than their size per se. To solve this problem, we modified the standard seed-and-extend approach (e.g., BLAST) to use adaptive seeds. Adaptive seeds are matches that are chosen based on their rareness, instead of using fixed-length matches. This method guarantees that the number of matches, and thus the running time, increases linearly, instead of quadratically, with sequence length. LAST, our open source implementation of adaptive seeds, enables fast and sensitive comparison of large sequences with arbitrarily nonuniform composition.
منابع مشابه
Optimal Spaced Seeds for Homologous Coding Regions
Optimal spaced seeds were developed as a method to increase sensitivity of local alignment programs similar to BLASTN. Such seeds have been used before in the program PatternHunter, and have given improved sensitivity and running time relative to BLASTN in genome-genome comparison. We study the problem of computing optimal spaced seeds for detecting homologous coding regions in unannotated geno...
متن کاملGenotyping-By-Sequencing (GBS) Detects Genetic Structure and Confirms Behavioral QTL in Tame and Aggressive Foxes (Vulpes vulpes)
The silver fox (Vulpes vulpes) offers a novel model for studying the genetics of social behavior and animal domestication. Selection of foxes, separately, for tame and for aggressive behavior has yielded two strains with markedly different, genetically determined, behavioral phenotypes. Tame strain foxes are eager to establish human contact while foxes from the aggressive strain are aggressive ...
متن کاملOptimal Spaced Seeds for Finding Homologous Coding Regions
We study the problem of computing optimal spaced seeds for identifying homologous coding DNA sequences in large genomic data sets. We develop two models of DNA sequence alignment in coding regions, and using data sets from human/Drosophila and human/mouse comparisons, we compute optimal spaced seeds using a dynamic programming algorithm. The seeds we identify are more sensitive by far at identi...
متن کاملIndel seeds for homology search
We are interested in detecting homologous genomic DNA sequences with the goal of locating approximate inverted, interspersed, and tandem repeats. Standard search techniques start by detecting small matching parts, called seeds, between a query sequence and database sequences. Contiguous seed models have existed for many years. Recently, spaced seeds were shown to be more sensitive than contiguo...
متن کاملCloud-based adaptive exon prediction for DNA analysis
Cloud computing offers significant research and economic benefits to healthcare organisations. Cloud services provide a safe place for storing and managing large amounts of such sensitive data. Under conventional flow of gene information, gene sequence laboratories send out raw and inferred information via Internet to several sequence libraries. DNA sequencing storage costs will be minimised by...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Genome research
دوره 21 3 شماره
صفحات -
تاریخ انتشار 2011